Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing

نویسنده

  • Preslav Nakov
چکیده

We describe the experiments of the UC Berkeley team on improving English-Spanish machine translation of news text, as part of the WMT’08 Shared Translation Task. We experiment with domain adaptation, combining a small in-domain news bi-text and a large out-of-domain one from the Europarl corpus, building two separate phrase translation models and two separate language models. We further add a third phrase translation model trained on a version of the news bi-text augmented with monolingual sentencelevel syntactic paraphrases on the sourcelanguage side, and we combine all models in a log-linear model using minimum error rate training. Finally, we experiment with different tokenization and recasing rules, achieving 35.09% Bleu score on the WMT’07 news test data when translating from English to Spanish, which is a sizable improvement over the highest Bleu score achieved on that dataset at WMT’07: 33.10% (in fact, by our system). On the WMT’08 English to Spanish news translation, we achieve 21.92%, which makes our team the second best on Bleu score.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

NUS at WMT09: Domain Adaptation Experiments for English-Spanish Machine Translation of News Commentary Text

We describe the system developed by the team of the National University of Singapore for English to Spanish machine translation of News Commentary text for the WMT09 Shared Translation Task. Our approach is based on domain adaptation, combining a small in-domain News Commentary bi-text and a large out-of-domain one from the Europarl corpus, from which we built and combined two separate phrase t...

متن کامل

Improving Word Alignment with Language Model Based Confidence Scores

This paper describes the statistical machine translation systems submitted to the ACL-WMT 2008 shared translation task. Systems were submitted for two translation directions: English→Spanish and Spanish→English. Using sentence pair confidence scores estimated with source and target language models, improvements are observed on the NewsCommentary test sets. Genre-dependent sentence pair confiden...

متن کامل

Statistical Post-Editing of Machine Translation for Domain Adaptation

This paper presents a statistical approach to adapt out-of-domain machine translation systems to the medical domain through an unsupervised post-editing step. A statistical post-editing model is built on statistical machine translation (SMT) outputs aligned with their translation references. Evaluations carried out to translate medical texts from French to English show that an out-of-domain mac...

متن کامل

Wiki-Translator: Multilingual Experiments for In-Domain Translations

The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-domain parallel corpora, followed by its adaptation to the domain in which new translations are needed. The adaptation to a new domain, especially for a narrow one, ...

متن کامل

Combining Domain Adaptation Approaches for Medical Text Translation

This paper explores a number of simple and effective techniques to adapt statistical machine translation (SMT) systems in the medical domain. Comparative experiments are conducted on large corpora for six language pairs. We not only compare each adapted system with the baseline, but also combine them to further improve the domain-specific systems. Finally, we attend the WMT2014 medical summary ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008